Goto

Collaborating Authors

 medical ai


One Patient, Many Contexts: Scaling Medical AI with Contextual Intelligence

Li, Michelle M., Reis, Ben Y., Rodman, Adam, Cai, Tianxi, Dagan, Noa, Balicer, Ran D., Loscalzo, Joseph, Kohane, Isaac S., Zitnik, Marinka

arXiv.org Artificial Intelligence

Medical AI, including clinical language models, vision-language models, and multimodal health record models, already summarizes notes, answers questions, and supports decisions. Their adaptation to new populations, specialties, or care settings often relies on fine-tuning, prompting, or retrieval from external knowledge bases. These strategies can scale poorly and risk contextual errors: outputs that appear plausible but miss critical patient or situational information. We envision context switching as a solution. Context switching adjusts model reasoning at inference without retraining. Generative models can tailor outputs to patient biology, care setting, or disease. Multimodal models can reason on notes, laboratory results, imaging, and genomics, even when some data are missing or delayed. Agent models can coordinate tools and roles based on tasks and users. In each case, context switching enables medical AI to adapt across specialties, populations, and geographies. It requires advances in data design, model architectures, and evaluation frameworks, and establishes a foundation for medical AI that scales to infinitely many contexts while remaining reliable and suited to real-world care.


An N-of-1 Artificial Intelligence Ecosystem for Precision Medicine

Fard, Pedram, Azhir, Alaleh, Rezaii, Neguine, Tian, Jiazi, Estiri, Hossein

arXiv.org Artificial Intelligence

Artificial intelligence in medicine is built to serve the average patient. By minimizing error across large datasets, most systems deliver strong aggregate accuracy yet falter at the margins: patients with rare variants, multimorbidity, or underrepresented demographics. This average patient fallacy erodes both equity and trust. We propose a different design: a multi-agent ecosystem for N-of-1 decision support. In this environment, agents clustered by organ systems, patient populations, and analytic modalities draw on a shared library of models and evidence synthesis tools. Their results converge in a coordination layer that weighs reliability, uncertainty, and data density before presenting the clinician with a decision-support packet: risk estimates bounded by confidence ranges, outlier flags, and linked evidence. Validation shifts from population averages to individual reliability, measured by error in low-density regions, calibration in the small, and risk--coverage trade-offs. Anticipated challenges include computational demands, automation bias, and regulatory fit, addressed through caching strategies, consensus checks, and adaptive trial frameworks. By moving from monolithic models to orchestrated intelligence, this approach seeks to align medical AI with the first principle of medicine: care that is transparent, equitable, and centered on the individual.


A global log for medical AI

Noori, Ayush, Rodman, Adam, Karthikesalingam, Alan, Mateen, Bilal A., Longhurst, Christopher A., Yang, Daniel, deBronkart, Dave, Galea, Gauden, Wolf, Harold F. III, Waxman, Jacob, Mandel, Joshua C., Rotich, Juliana, Mandl, Kenneth D., Mustafa, Maryam, Miles, Melissa, Shah, Nigam H., Lee, Peter, Korom, Robert, Mahoney, Scott, Hain, Seth, Wong, Tien Yin, Mundel, Trevor, Natarajan, Vivek, Dagan, Noa, Clifton, David A., Balicer, Ran D., Kohane, Isaac S., Zitnik, Marinka

arXiv.org Artificial Intelligence

Modern computer systems often rely on syslog, a simple, universal protocol that records every critical event across heterogeneous infrastructure. However, healthcare's rapidly growing clinical AI stack has no equivalent. As hospitals rush to pilot large language models and other AI-based clinical decision support tools, we still lack a standard way to record how, when, by whom, and for whom these AI models are used. Without that transparency and visibility, it is challenging to measure real-world performance and outcomes, detect adverse events, or correct bias or dataset drift. In the spirit of syslog, we introduce MedLog, a protocol for event-level logging of clinical AI. Any time an AI model is invoked to interact with a human, interface with another algorithm, or act independently, a MedLog record is created. This record consists of nine core fields: header, model, user, target, inputs, artifacts, outputs, outcomes, and feedback, providing a structured and consistent record of model activity. To encourage early adoption, especially in low-resource settings, and minimize the data footprint, MedLog supports risk-based sampling, lifecycle-aware retention policies, and write-behind caching; detailed traces for complex, agentic, or multi-stage workflows can also be captured under MedLog. MedLog can catalyze the development of new databases and software to store and analyze MedLog records. Realizing this vision would enable continuous surveillance, auditing, and iterative improvement of medical AI, laying the foundation for a new form of digital epidemiology.


Position Paper: Integrating Explainability and Uncertainty Estimation in Medical AI

Fan, Xiuyi

arXiv.org Artificial Intelligence

Uncertainty is a fundamental challenge in medical practice, but current medical AI systems fail to explicitly quantify or communicate uncertainty in a way that aligns with clinical reasoning. Existing XAI works focus on interpreting model predictions but do not capture the confidence or reliability of these predictions. Conversely, uncertainty estimation (UE) techniques provide confidence measures but lack intuitive explanations. The disconnect between these two areas limits AI adoption in medicine. To address this gap, we propose Explainable Uncertainty Estimation (XUE) that integrates explainability with uncertainty quantification to enhance trust and usability in medical AI. We systematically map medical uncertainty to AI uncertainty concepts and identify key challenges in implementing XUE. We outline technical directions for advancing XUE, including multimodal uncertainty quantification, model-agnostic visualization techniques, and uncertainty-aware decision support systems. Lastly, we propose guiding principles to ensure effective XUE realisation. Our analysis highlights the need for AI systems that not only generate reliable predictions but also articulate confidence levels in a clinically meaningful way. This work contributes to the development of trustworthy medical AI by bridging explainability and uncertainty, paving the way for AI systems that are aligned with real-world clinical complexities.


In defence of post-hoc explanations in medical AI

Hatherley, Joshua, Munch, Lauritz, Bjerring, Jens Christian

arXiv.org Artificial Intelligence

Since the early days of the Explainable AI movement, post-hoc explanations have been praised for their potential to improve user understanding, promote trust, and reduce patient safety risks in black box medical AI systems. Recently, however, critics have argued that the benefits of post-hoc explanations are greatly exaggerated since they merely approximate, rather than replicate, the actual reasoning processes that black box systems take to arrive at their outputs. In this article, we aim to defend the value of post-hoc explanations against this recent critique. We argue that even if post-hoc explanations do not replicate the exact reasoning processes of black box systems, they can still improve users' functional understanding of black box systems, increase the accuracy of clinician-AI teams, and assist clinicians in justifying their AI-informed decisions. While post-hoc explanations are not a "silver bullet" solution to the black box problem in medical AI, we conclude that they remain a useful strategy for addressing the black box problem in medical AI.


Limits of trust in medical AI

Hatherley, Joshua

arXiv.org Artificial Intelligence

This is a pre-print version of an article published as: Hatherley, Joshua. Please cite that version. 2 Abstract: Artificial intelligence (AI) is expected to revolutionise the practice of medicine. Recent advancements in the field of deep learning have demonstrated success in variety of clinical tasks: detecting diabetic retinopathy from images, predicting hospital readmissions, aiding in the discovery of new drugs, etc. AI's progress in medicine, however, has led to concerns regarding the potential effects of this technology upon relationships of trust in clinical practice. In this paper, I will argue that there is merit to these concerns, since AI systems can be relied upon, and are capable of reliability, but cannot be trusted, and are not capable of trustworthiness. Insofar as patients are required to rely upon AI systems for their medical decision-making, there is potential for this to produce a deficit of trust in relationships in clinical practice.


Towards a perturbation-based explanation for medical AI as differentiable programs

Abe, Takeshi, Asai, Yoshiyuki

arXiv.org Machine Learning

Recent advancement in machine learning algorithms reaches a point where medical devices can be equipped with artificial intelligence (AI) models for diagnostic support and routine automation in clinical settings. In medicine and healthcare, there is a particular demand for sufficient and objective explainability of the outcome generated by AI models. However, AI models are generally considered as black boxes due to their complexity, and the computational process leading to their response is often opaque. Although several methods have been proposed to explain the behavior of models by evaluating the importance of each feature in discrimination and prediction, they may suffer from biases and opacities arising from the scale and sampling protocol of the dataset used for training or testing. To overcome the shortcomings of existing methods, we explore an alternative approach to provide an objective explanation of AI models that can be defined independently of the learning process and does not require additional data. As a preliminary study for this direction of research, this work examines a numerical availability of the Jacobian matrix of deep learning models that measures how stably a model responses against small perturbations added to the input. The indicator, if available, are calculated from a trained AI model for a given target input. This is a first step towards a perturbation-based explanation, which will assist medical practitioners in understanding and interpreting the response of the AI model in its clinical application.


Generalization in medical AI: a perspective on developing scalable models

Behar, Joachim A., Levy, Jeremy, Celi, Leo Anthony

arXiv.org Artificial Intelligence

Over the past few years, research has witnessed the advancement of deep learning models trained on large datasets, some even encompassing millions of examples. While these impressive performance on their hidden test sets, they often underperform when assessed on external datasets. Recognizing the critical role of generalization in medical AI development, many prestigious journals now require reporting results both on the local hidden test set as well as on external datasets before considering a study for publication. Effectively, the field of medical AI has transitioned from the traditional usage of a single dataset that is split into train and test to a more comprehensive framework using multiple datasets, some of which are used for model development (source domain) and others for testing (target domains). However, this new experimental setting does not necessarily resolve the challenge of generalization. This is because of the variability encountered in intended use and specificities across hospital cultures making the idea of universally generalizable systems a myth. On the other hand, the systematic, and a fortiori recurrent re-calibration, of models at the individual hospital level, although ideal, may be overoptimistic given the legal, regulatory and technical challenges that are involved. Re-calibration using transfer learning may not even be possible in some instances where reference labels of target domains are not available. In this perspective we establish a hierarchical three-level scale system reflecting the generalization level of a medical AI algorithm. This scale better reflects the diversity of real-world medical scenarios per which target domain data for re-calibration of models may or not be available and if it is, may or not have reference labels systematically available.


California reparations panel warns of 'racially biased' medical AI, calls for legislative action

FOX News

Doctors believe Artificial Intelligence is now saving lives, after a major advancement in breast cancer screenings. A.I. is detecting early signs of the disease, in some cases years before doctors would find the cancer on a traditional scan. California's reparations task force is recommending as part of its set of proposals to make amends for slavery and anti-Black racism that state lawmakers address what it calls "racially biased" artificial intelligence used in health care. The task force, created by state legislation signed by Gov. Gavin Newsom in 2020, formally approved last weekend its final recommendations to the California Legislature, which will decide whether to enact the measures and send them to the governor's desk to be signed into law. The recommendations include several proposals related to health care, including some concerning medical artificial intelligence (AI), which the task force describes as "racially biased" and contributing to alleged systemic racism against Black Californians.


The promise--and pitfalls--of medical AI headed our way

#artificialintelligence

A patient is lying on the operating table as the surgical team reaches an impasse. They can't find the intestinal rupture. A surgeon asks aloud, "Check whether we missed a view of any intestinal section in the visual feed of the last 15 minutes." An artificial intelligence medical assistant gets to work reviewing the patient's past scans and highlighting video streams of the procedure in real time. It alerts the team when they've skipped a step in the procedure and reads out relevant medical literature when surgeons encounter a rare anatomical phenomenon.